Supporting Concept Extraction and Identifier Quality Improvement through Programmers' Lexicon Analysis

نویسنده

  • Surafel Lemma Abebe
چکیده

Identifiers play an important role in communicating the intentions associated with the program entities they represent. The information captured in identifiers support programmers to (re-)build the “mental model” of the software and facilitates understanding. (Re-)building the “mental model” and understanding large software, however, is difficult and expensive. Besides, the effort involved in the process heavily depends on the quality of the programmers’ lexicon used to construct the identifiers. This thesis addresses the problem of program understanding focusing on (i) concept extraction, and (ii) quality of the lexicon used in identifiers. To address the first problem (concept extraction), two ontology extraction approaches exploiting the natural language information captured in identifiers and structural information of the source code are proposed and evaluated. We have also proposed a method to automatically train a natural language analyzer for identifiers. The trained analyzer is used for concept extraction. The evaluation was conducted on a program understanding task, concept location. Results show that the extracted concepts increase the effectiveness of concept location queries. Besides extracting concepts from the source code, we have investigated information retrieval (IR) based techniques to filter domain concepts from implementation concepts. To address the second problem (quality of the lexicon used in identifiers), we have defined a publicly available catalog of lexicon bad smells (LBS) and developed a suite of tools to automatically detect them. LBS indicate some potential lexicon construction problems that can be addressed through refactoring. The impact of LBS on concept location and the contribution they can give to fault prediction have been studied empirically. Results indicate that LBS refactoring has a significant positive impact on IR-based concept location task and contributes to improve fault prediction, when used in conjunction with structural metrics. In addition to detecting LBS in identifiers, we try also to fix them. We have proposed an approach which uses the concepts extracted from the source code to suggest names which can be used to complete or replace an identifier. The evaluation of the approach shows that it provides useful suggestions, which can effectively support programmers to write consistent names.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature extraction in opinion mining through Persian reviews

Opinion mining deals with an analysis of user reviews for extracting their opinions, sentiments and demands in a specific area, which can play an important role in making major decisions in such area. In general, opinion mining extracts user reviews at three levels of document, sentence and feature. Opinion mining at the feature level is taken into consideration more than the other two levels d...

متن کامل

Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora

Previous work on bilingual lexicon extraction from comparable corpora aimed at finding a good representation for the usage patterns of source and target words and at comparing these patterns efficiently. In this paper, we try to work it out in another way: improving the quality of the comparable corpus from which the bilingual lexicon has to be extracted. To do so, we propose a measure of compa...

متن کامل

Extraction de lexiques bilingues à partir de corpus comparables spécialisés : étude du contexte lexical

This work focuses on the concept of lexical context that is central to the historical approach of bilingual lexicon extraction from specialized comparable corpora. First, we revisit the two main strategies dedicated to lexical context characterization, that rely on the use of window-based and syntactic-based representations. We show that the combination of these two representations has a partic...

متن کامل

Models of EFL Learners’ Vocabulary Development: Spreading Activation vs. Hierarchical Network Model

Semantic network approaches view organization or representation of internal lexicon in the form of either spreading or hierarchical system identified, respectively, as Spreading Activation Model (SAM) and Hi- erarchical Network Model (HNM). However, the validity of either model is amongst the intact issues in the literature which can be studied through basing the instruction compatible wi...

متن کامل

Looking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction

The main work in bilingual lexicon extraction from comparable corpora is based on the implicit hypothesis that corpora are balanced. However, the historical contextbased projection method dedicated to this task is relatively insensitive to the sizes of each part of the comparable corpus. Within this context, we have carried out a study on the influence of unbalanced specialized comparable corpo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013